Introduction

The purpose of this report is to document quality-control and data transformation procedures followed to obtain methylation values (\(\beta\) and M) for the adolescent vaping study. Much of the code for this report had already been produced by the previous analyst, Cuining Liu. Further steps were taken to dissect the data-pipeline and to support any decisions made moving forward.

Methods

Methylation samples were taken using the 850K ‘EPIC’ array. All quality control steps were conducted using the package SeSAMe ver. 1.15.7. The steps below are not necessarily presented in chronological order which they were taken. The order of pre-processing steps follows the openSesame data pipeline available here (Zhou et al. 2018).

Sample Removal

Samples will be evaluated for a low mean-intensity to identify potential outliers with low-quality methylation values. Outliers will be defined by samples in the bottom 1%.

Probe Masking

The SeSAMe package uses probe ‘masking’ instead of probe removal. These terms are interchangeable; however, it should be noted that ‘masked’ probes are not actually removed from the dataset. They are only ignored by SeSAMe when completing down-stream analyses. Probes are masked in two ways:

  1. Experiment-independent Probe Masking: Probes that are masked due to non-unique mapping or influence by SNPs.
  2. Experiment-dependent Probe Masking: Probes that are masked based on the detection p-value, which represents the probability of a detection signal being background flourescence. Masks are based on the detection p-value, which represents the probability of a detection signal being background fluorescence (Zhou et al. 2018).

Experiment-independent probe masking is set by a pre-determined list of probes specific to the ‘EPIC’ array. Experiment -dependent probe masking will be determined by a p-value threshold. CpG probes with a detection p-value > 0.05 in at least 10% of samples will be masked for the purpose of this study.

Data Transformations

There are two data transformations within the sesame pipeline. The first is a dye bias correction. Dye bias refers to bias in the methylation values due to the performance of the red and green dyes that create flourescence and are then interpreted as methylation values. There are several corrections that can be applied to make the red and green values more comparable. sesame specifically implements a non-linear dye bias correction.

The next transformation within the openSesame pipeline is background subtraction. The purpose of background subtraction is to align the distribution of beta values for Infinium I and II probes in order to make them more comparable. The default method for sesame is a normal-exponential deconvolution using out-of-band probes or ‘noob’ (Triche et al. 2013).

Beta and M-Value Distributions

The product of the SeSAMe data pipeline is a matrix of beta values. Beta values will be converted to M-values using the logit transformation

\[M = log_2(\frac{\beta}{1 - \beta})\]

Visualizations will ensure the proper distribution of \(\beta\) and M-values, respectively.

Sample Clustering

Visualizations of clustering by sex, recruitment center, and vape status will help to detect any technical effects that need to be accounted for in downstream analyses.

Results

Of the 51 subjects surveyed for this project, methylation data were collected for 48. Two of the three subjects (SID c(105, 137) & NA) for whom there was no transcript data also had no methylation data. One sample (SID 102) included in the RNASeq Analysis known as “Sample 12” also had missing methylation data. This sample was the center of a sensitivity analysis for inclusion in RNASeq analyse (see “sample12_sensitivity_report_2022_mm_dd.html”), and it should be noted that this subject will not be available for comparison. One sample (SID 144) lacked vape status in clinical metadata and was removed after obtaining beta values for all subjects in order to retain the benefits of that sample for normalization purposes. This sample was also excluded from the RNASeq analysis. Downstream analyses will include 47 samples.

Subject ID RNASeq ID Methylation ID
102 Sample12 No Data
105 No Data No Data
137 No Data No Data

Outlier Samples

127 falls in the bottom 1% of mean signal intensity readings.

Figure 1: Sample Quality Outliers

Considering the small sample size in this experiment, it is advisable not to remove 127, but it is worth noting for downstream analyses.

Probe Masking

105,454 CpG sites were removed by non-experimental probe masking and 14,349 were removed by experimental probe masking (detection p-value > 0.05 in \(\ge\) 10% of samples). 119,803 probes were removed in total.

Data Transformations

Figure 2 demonstrates the non-linear dye bias correction for a small subset of samples.

Figure 2: Dye Bias Correction

Overall, the correction worked as expected, but it should be noted that there is still a slight bias towards the green signal for higher-intensity values.

Figure 3 demonstrates the ‘noob’ method of background subtraction for a single sample.

Figure 3: Sample Beta Value Distribution for Infinium I and II Probes

‘noob’ background subtraction shifted the distributions of probes to have modes closer to 0 and 1 (the expected distribution of beta values). Additionally, it appears that some noise was removed from the unmethylated beta values.

Beta and M-Value Distributions

Overall, \(\beta\) and M-values followed their expected distributions.

Figure 4: \(\beta\) and M-Value Distributions

Figure 4 indicates some noise around \(\beta\) = 0.5 or \(M\) = 0. This noise could indicate probes that were not removed by experiment-dependent probe masking. Noise may be reduced at a more stringent threshold for detection p-values (e.g. < 0.01).

Clustering by Sex

Visualization of the samples by median X and Y chromosome intensities helps to identify samples with poor quality or samples whose clinical sex do not match predicted sex based on these values. Figure 4 displays plots of median intensity for X and Y chromosomes color-coded for both clinical and predicted sex.

Figure 5: Sex Verification

Figure 5 demonstrates sample clustering by sex for median X and Y intensities. Clinical sex matches predicted sex for all samples. It should be noted that SID 122 recorded ‘non-binary’ as clinical sex, and sex was inferred by methylation data for other portions of this project.

Sample Clustering

MDS plots were made using the package minfi ver. 1.43.0 using \(\beta\) and M-values preprocessed using the sesame pipeline. For each feature of interest, plots were made a) using probes from the whole genome and b) using only autosomal probes. Using only autosomal probes removes the inherrent separation by sex to get a better idea of clustering patterns.

Figure 6: Sample Clustering (Sex)

Whole Genome

Autosomes Only

Figure 7: Sample Clustering (Vape Status)

Whole Genome

Autosomes Only

Figure 8: Sample Clustering (Recruitment Center)

Whole Genome

Autosomes Only

Conclusions

The review of sample quality and clustering patterns returned one subject (SID = 127) that fell in the 1st percentile for mean intensity. There is also a clear sex effect in the data that will be accounted for by including sex as a model covariate.

References

Triche, Timothy J., Daniel J. Weisenberger, David Van Den Berg, Peter W. Laird, and Kimberly D. Siegmund. 2013. “Low-Level Processing of Illumina Infinium DNA Methylation BeadArrays.” Nucleic Acids Research 41 (7): e90–90. https://doi.org/10.1093/nar/gkt090.
Zhou, Wanding, Timothy J Triche, Peter W Laird, and Hui Shen. 2018. “SeSAMe: Reducing Artifactual Detection of DNA Methylation by Infinium BeadChips in Genomic Deletions.” Nucleic Acids Research, July. https://doi.org/10.1093/nar/gky691.